Binary Response Models

Author
Affiliation

Dave Clark

Binghamton University

Published

August 6, 2024

Binary variables

Conceive of two types of binary variables:

  • dummy variables
  • binomials - can only take on 0/1 values; are not a representation of an underlying continuum

Linear Probability Model

The LPM is the OLS model estimated with a binary dependent variable.

This is generally frowned upon.

Linear model

The model we’ve worked with so far is

\[y_i=F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 \ldots + \beta_k X_k + \epsilon_i) \]

where \(F\) is a linear function, so

\[\hat{y_i}=\hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2 \ldots + \hat{\beta_k} X_k \nonumber\\ \nonumber\\ =x\hat{\beta} \]

\(F\) is linear, so the model is linear in parameters. \(x\hat{\beta}\) is the linear prediction and is the quantity of interest, the conditional expected value of \(y\).

Now, consider a situation where \(y\) is binary, such that,

\[ y = \left\{ \begin{array}{ll} 1 \\ 0 \end{array} \right. \]

and

\[ y^* = \left\{ \begin{array}{ll} \pi_{i}\\ 1-\pi_{i} \end{array} \right. \]

where \(y_i^*\) is what we wish we could measure (say, the probability \(y_i=1\)), though we can only measure \(y_i\). Now, \(y^*\) is going to be our principle quantity of interest.

Suppose the regression model

\[y_i =F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 \ldots + \beta_k X_k + \epsilon_i) \]

where \(F\) is a nonlinear function relating the linear prediction, \(x\beta\) to \(y_i^*\).

\[\widehat{y^*} = F(x\widehat{\beta})=F(\hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2 \ldots + \hat{\beta_k} X_k ) \]

\(F\) is a nonlinear function, so the model is nonlinear in parameters. \(x\hat{\beta}\) is the linear prediction, but it is not the quantity of interest. Instead, the quantity of interest is \(F(x\hat{\beta})\) which is equal to \(y_i^*\).

Important Concept

The difference between this model and the OLS linear model is simply that we must transform the linear prediction, \(x\hat{\beta}\), by \(F\) in order to produce predictions. Put differently, we want to map our linear prediction, \(x\beta\) onto \(y^*\).}

Why leave the robust OLS model?

Why not use the linear model when \(y\) is binary?

  • because we fail to satisfy the OLS assumptions.
  • because the residuals are not Normal.
  • because \(y\) is not Normal.
  • because \(y\) is limited.

Limited \(y\) variables are \(y\)s where our measurement is limited by the realities of the world. Such variables are rarely normal, often not continuous, and often observable indicators of unobservable things - this is all true of binary variables.

Limited Dependent Variables

Why would we measure \(y_i\) rather than \(y_i^*\)?

Limited dependent variables are usually limited in the sense that we cannot observe the range of the variable or the characteristic of the variable we want to observe. We are limited to observing \(y_i\), and so must estimate \(y_i^*\).

Examples of Limited DVs

  • binary variables: 0=peace, 1=war; 0=vote, 1=don’t vote.

  • unordered or nominal categorical variables: type of car you prefer: Honda, Toyota, Ford, Buick; policy choices; consumer choices.

  • ordered variables that take on few values: some survey responses.

-discrete count variables; number of episodes of scarring torture in a country-year, 0, 1, 2, 3, …, \(\infty\)

  • time to failure; how long a civil war lasts; how long a patient survives disease; how long a leader survives in office.

Binary dependent variables

Generally, we conceive of a binary variable as being the observable manifestation of some underlying, latent, unobserved continuous variable.

If we could adequately observe (and measure) the underlying continuous variable, we’d use some form of OLS regression to analyze that variable.

Why not use OLS?

\[ \mathbf{y}=\mathbf{X \beta} + \mathbf{u} \]

where we are principally interested in the conditional expectation of \(y\), \(E(y_{i}|\mathbf{x_{i}})\) where we want to interpret that expectation as a conditional probability, \(Pr(y=1|\mathbf{x_{i}})\); we focus on the probability the outcome occurs (i.e., \(y\) is equal to one).

Linear Probability Model

The linear probability model (LPM) is the OLS linear regression with a binary dependent variable.

The main justification for the LPM is OLS is unbiased (by Gauss Markov). But \(\ldots\)

  • predictions are nonsensical (linear, unbounded, measures of \(\hat{y}\) rather than \(y^*\)).

  • disturbances are non-normal, and heteroskedastic.

  • relation or mapping of \(x\beta\) and \(y\) are the wrong functional form (linear).

Running example - Democratic Peace data

As a running example, I’ll use the Democratic Peace data to estimate logit and probit models. These come from Oneal and Russett (1997)’s well-known study in ISQ. The units are dyad-years; the \(y\) variable is the presence or absence of a mililtarized dispute, and the \(x\) variables include a measure of democracy (the lowest of the two Polity scores in the dyad), and a set of controls.

Predictions out of bounds

code
dp <- read_dta("/Users/dave/Documents/teaching/501/2023/slides/L7_limiteddv/code/dp.dta")

m1 <-glm(dispute ~ border+deml+caprat+ally, family=binomial(link="logit"), data=dp )
logitpreds <- predict(m1, type="response")

mols <-lm(dispute ~ border+deml+caprat+ally, data=dp )
olspreds <- predict(mols)

df <- data.frame(logitpreds, olspreds, dispute=as.factor(dp$dispute))

ggplot(df, aes(x=logitpreds, y=olspreds, color=dispute)) + 
  geom_point()+
  labs(title="Predictions from Logit and OLS", x="Logit Predictions", y="OLS Predictions")+
  geom_hline(yintercept=0)+
  theme_minimal() +
  annotate("text", x=.05, y=-.05, label="2,147 Predictions out of bounds", color="red")

code
ggplot(df, aes(x=olspreds)) + 
  geom_density(alpha=.5)+
  labs(title="Density of OLS Predictions", x="Predictions", y="Density")+
  theme_minimal()+
geom_vline(xintercept=0, linetype="dashed")

Heteroskedastic Residuals

code
df <- data.frame(df, mols$residuals)
 
ggplot(df, aes(x=mols.residuals, color=dispute)) + 
  geom_density()+
  labs(title="Density of OLS Residuals", x="Residuals", y="Density")+
  theme_minimal()+
  geom_vline(xintercept=0, linetype="dashed")

When is the LPM Reasonable?

On Linearity

In the linear model, \(\hat{y_i}=x_i\beta\). This makes sense because \(y = y^*\). Put differently, \(y\) is continuous, unbounded, (assumed) normal, and is an “unlimited” measure of the concept we intend to measure.

In binary models, \(y \neq y^*\), because our observation of \(y\) is limited such that we can only observe its presence or absence. We have two different realizations of the same variable: \(y\) is the limited but observed variable; \(y^*\) is the unlimited variable we want to measure, but cannot because it is unobservable.

The goal of these models is to use \(y\) in the regression in order to get estimates of \(y^*\). Those estimates of \(y^*\) are our principle quantity of interest in the binary variable model.

Linking \(x\widehat{\beta}\) and \(y^*\)

We can produce the linear prediction, \(x\widehat{\beta}\), but we need to transform it to produce estimates of \(y^*\). To do so, we use a link function to map \(x_i\beta\) onto the probability space, \(y^*\). This means \(\widehat{y_i} \neq x\widehat{\beta}\). Instead,

\[y^* = F(x_i\beta)\]

Where \(F\) is a continuous, sigmoid probability CDF. This is how we get estimates of our quantity of interest, \(y^*\).

Non linear change in Pr(y=1) across values of \(x\)

In the LPM, the relationship between \(Pr(y=1)\) and \(X\) is linear, so the rate of change toward \(Pr(y=1)\) is constant across all values of \(X\).

This means that the rate of change approaching one (or approaching zero) is exactly the same as the rate of change anywhere else in the distribution.

For example, this means that the change from .99 to 1.00 is just as likely as the change from .50 to .51; is this sensible for a bounded latent variable (probability)?

Linear and Sigmoid Functions at Limits

code
z <- seq(-5,5,.1)
l <- seq(0,1,.01)
s1 <- 1/(1+exp(-z))
s2 <- pnorm(z)

ggplot() + 
  geom_line(aes(x=z, y=l), color="black")+
  geom_line(aes(x=z, y=s1), color="red")+
  geom_line(aes(x=z, y=s2), color="green")+
  labs(title="Linear and Sigmoid Functions", x="z", y="F(z)")+
  theme_minimal()+
  annotate("text", x=0, y=.75, label="Normal", color="green")+
  annotate("text", x=-3, y=.1, label="Logistic", color="red")

Non constant change in Pr(y=1) across values of \(z\)

Animating the change in \(Pr(y=1)\) across values of \(z\) for the linear and sigmoid functions.

code
library(gganimate)

z <- seq(-5,5,.1)
l <- seq(0,1,.01)
s1 <- 1/(1+exp(-z))
s2 <- pnorm(z)

df <- data.frame(z=z, l=l, s1=s1, s2=s2)

ggplot(df, aes(x=z, y=l)) + 
  geom_line(aes(x=z, y=l), color="black")+
  geom_line(aes(x=z, y=s1), color="red")+
  geom_line(aes(x=z, y=s2), color="green")+
  labs(title="Rates of change", x="z", y="F(z)")+
  theme_minimal()+
  transition_reveal(z)

The residuals are not normally distributed (except asymptotically). Suppose \[\begin{aligned} y_{i}=1;~~~ u_{i} = 1-x_{i} \hat{\beta} \nonumber \\ y_{i}=0;~~~ u_{i} = -x_{i} \hat{\beta} \nonumber \end{aligned}\]

The disturbance term, \(u_{i}\) only takes on two values (just like \(y_{i}\)); it follows the binomial rather than the normal distribution. This is not necessarily that serious a problem since the OLS estimates will still be unbiased.

}

The disturbances are heteroskedastic} The conditional expected value of \(Y\) is equivalent to the conditional probability of \(Y^*\) (\(E(y_{i}|\mathbf{x_{i}}=Pr(y=1|\mathbf{x_{i}})\)), the variance of the disturbance term, \(u\) is \[\begin{aligned} var(u_{i}) = E(Y_{i}|X_{i})[1-E(Y_{i}|X_{i})] \nonumber\\ =p_{i}(1-p_{i}) \nonumber \end{aligned}\]

So the variance of the disturbance term depends explicitly on the conditional expectation of \(Y\) which is conditional on \(X\). Put another way, the variance of \(u\) is dependent on the independent variables and so is neither homoskedastic nor independent of the \(X\)s, nor likely of \(var(u_{j})\).

}

Predictions}

The predictions of \(Y\) (the conditional expectation \(E(y_{i}|\mathbf{x_{i}})\)) are not necessarily bounded by zero and one: \(0 \leq E(y_{i}|\mathbf{x_{i}}) \leq 1\) is not always fulfilled. \~\

Linearity doesn’t seem like the right functional form.

}

Examples} This week’s video runs through some examples of these problems. }

Why move to ML? Lipstick on a pig …}

OLS is a rockin’ estimator, but it’s just not well suited to limited \(y\) variables. Efforts to rehabilitate the LPM are like putting lipstick on a pig.

Deriving an LLF from the ground up

So let’s build a model for a binary \(y\) variable.

  • Observe \(y\), consider its distribution, write the PDF.
  • Write the joint probability of the data, using the chosen probability distribution.
  • Write the joint probability as a likelihood:
  • Simplify - take logs, etc.
  • Parameterize
  • Write in the link function, linking the systematic component of the model to the latent variable, \(\tilde{y}\).

A nonlinear model for binary data

So \(y\) is binary, and we’ve established the linear model is not appropriate. The observed variable, \(y\), appears to be binomial (iid):

\[ y \sim f_{binomial}(\pi_i)\]

[ y = {

ll 1, & _i
0, & 1-_i

. ]

\[ \pi_i = F(x_i\widehat{\beta}) \] \[1- \pi_i=1-F(x_i\widehat{\beta})\]

%\[ \pi_i = \frac{1}{1+exp(-x\beta)}\]

Binomial LLF

Write the PDF:

\[ Pr(y=1| \pi) = \pi_i^{y_i} (1-\pi_i)^{1-y_i} \]

Write the joint probability as a likelihood:

\[\mathcal{L} (\pi |\ y) = \prod \limits_{i=1}^{n} \left[ \pi_i^{y_i} (1-\pi_i)^{1-y_i}\right]\]

Write the log-likelihood:

\[\ln \mathcal{L} (\pi| \ y) = \sum \limits_{i=1}^{n} \left[ y_i \ln ( \pi_i) + (1-y_i) \ln(1-\pi_i)\right]\]

Parameterize the Distribution parameter \(\pi_i\)

Parameterize \(\pi_i\):

\[\pi_i= F(x \beta)\]

This is the binomial log-likelihood function.

\[\ln \mathcal{L} (\pi| \ y) = \sum \limits_{i=1}^{n} \left[ y_i \ln (F(x_i\widehat{\beta})) + (1-y_i) \ln(1-F(x_i\widehat{\beta}))\right]\]

But we need to fill in \(F\), the link function.

Probit and Logit LLFs

Probit - link between \(x\hat{\beta}\) and \(Pr(y=1)\) is standard normal CDF: \[ \ln \mathcal{L} (Y|\beta) = \sum_{i=1}^{N} y_i \ln \Phi(\mathbf{x_i \beta})+ (1-y_i) \ln[1-\Phi(\mathbf{x_i \beta})] \nonumber \]

Logit (logistic CDF):

\[ \ln \mathcal{L} (Y|\beta) = \sum_{i=1}^{N} \left\{ y_i \ln \left(\frac{1}{1+e^{-\mathbf{x_i \beta}}}\right)+ (1-y_i) \ln \left[1-\left(\frac{1}{1+e^{-\mathbf{x_i \beta}}}\right)\right] \right\}\nonumber \]

Sigmoid Functions There’s a large number of sigmoid shaped probability functions that will satisfy these needs.

On Linearity

In the linear model, \(\hat{y_i}=x_i\beta\). This makes sense because \(y = \tilde{y}\). Put differently, \(y\) is continuous, unbounded, (assumed) normal, and is an unlimited measure of the concept we intend to measure. Also, \(x_i\beta\) is in units of \(y\), so no mapping is necessary.

In binary models, \(y \neq \tilde{y}\), because our observation of \(y\) is limited such that we can only observe its presence or absence. We have two different realizations of the same variable: \(y\) is the limited but observed variable; \(\tilde{y}\) is the unlimited variable we want to measure, but cannot because it is unobservable}.

The goal of these models is to use \(y\) in the regression in order to get estimates of \(\tilde{y}\). Those estimates are our principle quantity of interest} in the binary variable model.

Since \(y \neq \tilde{y}\), we use the link} function to map \(x_i\beta\) onto the space \(\tilde{y}\)}. This means \(\hat{y_i} \neq x_i\beta\). Instead,

\[ \tilde{y} = F(x_i\beta) \nonumber \] Thus, we get estimates of our quantity of interest, \(\tilde{y}\).

In the LPM, the relationship between \(Pr(y=1)\) and \(X\) is linear, so the rate of change toward \(Pr(y=1)\) is constant across all values of \(X\).

This means that the rate of change approaching one (or approaching zero) is exactly the same as the rate of change anywhere else in the distribution.
For example, this means that the change from .99 to 1.00 is just as likely as the change from .50 to .51; is this sensible for a bounded latent variable (probability)?

Non constant change in Pr(y=1) across values of \(z\)

Binary response interpretation

  • Signs and significance - all the usual rules apply.
  • Quantities of interest - most commonly \(Pr(y=1|X)\); or marginal effects.
  • Measures of uncertainty (e.g. confidence intervals) are a must (as always).

Non linear models: Predicted probabilities

In the nonlinear model, the most basic quantity is

\[F(x\widehat{\beta})\]

where \(F\) is the link function, mapping the linear prediction onto the prediction space.

Logit

\[\Lambda(x\widehat{\beta}) = \frac{exp(-x\widehat{\beta})}{1+exp(x\widehat{\beta})} \] ~\ \[= \frac{1}{1+exp(-x\widehat{\beta})}\]

Probit

\[F(x\widehat{\beta}) = \Phi(x\widehat{\beta})\]

Some models have multiple quantities, eg count models} The Poisson event count, for instance:

Expected value of \(Y|X\) - this would be the expected number of events:

\[E[Y|X] = \widehat{\lambda} = exp(x\widehat{\beta})\]

Probability \(Y=y_i\) for each value of \(Y\), e.g., the probability of observing \(i=2\) events:

\[Pr(Y = y_i) = \frac{(x\widehat{\beta})^{y_i} \cdot exp(x\widehat{\beta}) }{y!} \]

Marginal Effects, Linear Model

In the linear model, the marginal effect of \(x\) is \(\widehat{\beta}\). That is, the effect of a one unit change in \(x\) on \(y\) is \(\widehat{\beta}\).

\[ \frac{\partial \widehat{y}}{\partial x_k}= \frac{\partial x \widehat{\beta}}{\partial x_k} \nonumber \\ \nonumber \\ = \widehat{\beta} \nonumber \]

The marginal effect is constant with respect to \(x_k\).

Marginal Effects, Nonlinear Model

In the nonlinear model, the marginal effect of \(x_k\) depends on where \(x\widehat{\beta}\) lies with respect to the probability distribution \(F(\cdot)\).

\[ \frac{\partial Pr(y=1)}{\partial x_k}= \frac{\partial F(x\widehat{\beta})}{\partial x_k} \nonumber \\ \nonumber \\ = \frac{\partial F(x\widehat{\beta})}{\partial x\widehat{\beta}} \cdot \frac{\partial (x\widehat{\beta})}{\partial x_k} \nonumber \]

Both of these terms simplify …

Remember that

\[ \frac{\partial (x\widehat{\beta})}{\partial x} = \widehat{\beta} \nonumber \]

and \[ \frac{\partial F(x\widehat{\beta})}{\partial x\widehat{\beta}} = f(x\widehat{\beta}) \nonumber \]

where the derivative of the CDF is the PDF.

Putting these together gives us:

\[ \frac{\partial F(x\widehat{\beta})}{\partial x\widehat{\beta}} = f(x\widehat{\beta}) \widehat{\beta} \nonumber \]

This is \(\widehat{\beta}\) weighted by or measured at the ordinate on the PDF - the ordinate is the height of the PDF associated with a value of the \(x\) axis (an abscissa).

Logit Marginal Effects

For logit,

\[ \frac{\partial \Lambda(x\widehat{\beta})}{\partial x\widehat{\beta}} = \lambda(x\widehat{\beta}) \widehat{\beta} \nonumber \]

Recall that \(\Lambda\) is the logit CDF (\(1/(1+exp(-x_i\widehat{\beta}))\)), and \(\lambda\) is the logit PDF (\(1/(1+exp(-x_i\widehat{\beta}))^2\)).

Also, remember that

\[\frac{e^{x_i\widehat{\beta}}}{1+e^{x_i\widehat{\beta}}} = \frac{1}{1+e^{-x_i\widehat{\beta}}}\]

Logit Marginal Effect

\[ \begin{align} \frac{\partial \Lambda(x\widehat{\beta})}{\partial x\widehat{\beta}} = \lambda(x\widehat{\beta}) \widehat{\beta} \\ = \frac{e^{x_i\widehat{\beta}}}{(1+e^{x_i\widehat{\beta}})^2} \widehat{\beta} \\ =\frac{e^{x_i\widehat{\beta}}}{1+e^{x_i\widehat{\beta}}} \frac{1}{1+e^{x_i\widehat{\beta}}} \widehat{\beta} \\ =\Lambda(x_i\widehat{\beta}) \frac{1+e^{x_i\widehat{\beta}}-e^{x_i\widehat{\beta}}}{1+e^{x_i\widehat{\beta}}} \widehat{\beta} \\ =\Lambda(x_i\widehat{\beta}) 1-\frac{e^{x_i\widehat{\beta}}}{1+e^{x_i\widehat{\beta}}} \widehat{\beta} \\ =\Lambda(x_i\widehat{\beta}) (1-\Lambda(x_i\widehat{\beta})) \widehat{\beta} \end{align} \]

Logit Marginal Effect (Maximum)

This is cool because

\[\frac{\partial \Lambda(x\widehat{\beta})}{\partial x\widehat{\beta}} = \lambda(x\widehat{\beta}) \widehat{\beta}\\ =\Lambda(x_i\widehat{\beta}) (1-\Lambda(x_i\widehat{\beta})) \widehat{\beta} \]

means the marginal effect is \(\widehat{\beta}\) weighted by the probability of \(y=1\) times the probability of \(y=0\). Since the largest value this can take on is \(Pr(y_i=1)=0.5 \cdot Pr(y_i=0)=0.5= 0.25\), then the maximum marginal effect is \(0.25 \widehat{\beta}\).

Visualizing Logit Marginal Effects

code
z <- seq(-5,5,.1)
p <- plogis(z)
d <- dlogis(z)

df <- data.frame(z=z, p=p, d=d)

#plot pdf and cdf with reference line at y=.39
ggplot(df, aes(x=z)) + 
  geom_line(aes(y=d), color="black")+
  geom_line(aes(y=p), color="red")+
  geom_hline(yintercept=.25, linetype="dashed")+
  labs(title="Logistic PDF and CDF", x="z", y="F(z)")+
  theme_minimal()

Probit Marginal Effects (Maximum)

In the probit model, this is simply

\[ \frac{\partial \Phi(x\widehat{\beta})}{\partial x\widehat{\beta}} = \phi(x\widehat{\beta}) \widehat{\beta} \nonumber \]

The ordinate at the maximum of the standard normal PDF is 0.3989 - rounding to 0.4, we can say that the maximum marginal effect of any \(\widehat{\beta}\) in the probit model is \(0.4\widehat{\beta}\).

The ordinate is at the maximum where \(z=0\); recall this is the standard normal, so \(x_i\widehat{\beta}=z\). When \(z=0\),

\[Pr(z)=\frac{1}{\sqrt{2 \pi}} \exp \left[\frac{-(z)^{2}}{2}\right] \nonumber \\ \nonumber\\ =\frac{1}{\sqrt{2 \pi}} \nonumber\\ \approx .4 \nonumber \]

Visualizing Probit Marginal Effects

code
z <- seq(-5,5,.1)
p <- pnorm(z)
d <- dnorm(z)

df <- data.frame(z=z, p=p, d=d)

#plot pdf and cdf with reference line at y=.39
ggplot(df, aes(x=z)) + 
  geom_line(aes(y=d), color="black")+
  geom_line(aes(y=p), color="red")+
  geom_hline(yintercept=.3989, linetype="dashed")+
  labs(title="Standard Normal PDF and CDF", x="z", y="F(z)")+
  theme_minimal()

Marginal Effects in the Nonlinear Model

code
z <- seq(-5,5,.1)
ncdf <- pnorm(z)
npdf <- dnorm(z)
lcdf <- plogis(z)
lpdf <- dlogis(z)

df <- data.frame(ncdf=ncdf, npdf=npdf, lcdf=lcdf, lpdf=lpdf, z=z)

#plot pdf and cdf with reference line at y=.39
# ggplot(df, aes(x=z)) + 
#   geom_line(aes(y=npdf), color="black")+
#   geom_line(aes(y=ncdf), color="black")+
#   geom_line(aes(y=lpdf), color="blue")+
#   geom_line(aes(y=lcdf), color="blue")+
#   geom_hline(yintercept=.3989, linetype="dashed")+
#   geom_hline(yintercept=.25, linetype="dashed")+
#   labs(title="Standard Normal and Logistic PDFs and CDFs", x="z", y="F(z)")+
#   theme_minimal()
# 

highchart() %>%
  hc_add_series(df, "line", hcaes(x = z, y = npdf)) %>%
  hc_add_series(df, "line", hcaes(x = z, y = ncdf)) %>%
  hc_add_series(df, "line", hcaes(x = z, y = lpdf)) %>%
  hc_add_series(df, "line", hcaes(x = z, y = lcdf)) %>%
  hc_add_theme(hc_theme_flat()) %>%
  hc_xAxis(title = list(text = "z")) %>%
  hc_yAxis(title = list(text = "F(z)")) %>%
  hc_add_theme(hc_theme_flat()) %>%
  hc_legend(enabled = FALSE)

Logit Odds Interpretation

The odds are given by the probability an event occurs divided by the probability it does not:

\[ \Omega(X) = \frac{Pr(y=1)}{1-Pr(y=1)} \nonumber = \frac{\Lambda(X\widehat{\beta})}{(1-\Lambda(X\widehat{\beta}))} \nonumber \]

Logit Log-odds

Logging …

\[\ln \Omega(X) = \ln \left(\frac{\Lambda(X\widehat{\beta})}{(1-\Lambda(X\widehat{\beta}))}\right) =X\widehat{\beta} \]

\[ \frac{\partial \ln \Omega}{\partial X} = \widehat{\beta} \nonumber \]

Which shows the change in the log-odds given a change in \(X\) is constant (and therefore linear). This quantity is sometimes called “the logit.”

Logit Odds Ratios

Odds ratios are very useful:

\[ \frac{ \Omega x_k + 1}{\Omega x_k} =exp(\widehat{\beta_k}) \nonumber \]

comparing the difference in odds between two values of \(x_k\); note the change in value does not have to be 1.

\[ \frac{ \Omega x_k + \iota}{\Omega x_k} =exp(\widehat{\beta_k}* \iota) \nonumber \]

Logit Odds Ratios

Not only is it simple to exponentiate \(\widehat{\beta_k}\), but the interpretation is that \(x\) increases/decreases \(Pr(y=1)\) by that factor, \(exp(\widehat{\beta_k})\), and more usefully, that:

\[ 100*(exp(\widehat{\beta_k})-1) \nonumber \]

is the percentage change in the odds given a one unit change in \(x_k\).

So a logit coefficient of .226

\[ 100*(exp(.226)-1) =25.36 \nonumber \]

Produces a 25.36% increase in the odds of \(y\) occurring.

Interpreting Binary Models

Probit and logit coefficients {are} directly interpretable in the senses that

  • We can interpret direction.
  • We can interpret statistical difference from zero.
  • We can say the largest marginal effect of \(x \approx 0.4\cdot\widehat{\beta}\) for the probit model.
  • We can say the largest marginal effect of \(x \approx 0.25\cdot\widehat{\beta}\) for the logit model.
  • We can say that \(exp(\widehat{\beta_k})-1\) is the percentage change in the odds that \(y=1\), for the logit model.

It’s still the case that we often want other {quantities of interest} like probabilities, and that requires the straightforward transformations of the linear prediction, \(F(x_i\widehat{\beta})\).

Two general types of predictions

  • MEM - Marginal Effects at Means - set variables to means/medians/modes, vary \(x\) of interest, generate effect.

  • AME - Average Marginal Effects - set \(x\) of interest to value of interest, predict, average, then repeat at next value of interest.

Marginal Effects at Means (MEM)

MEMs are what they sound like - effects with independent variables set at central tendencies.

estimate model. create out of sample data. vary \(x\) of interest; set all other \(x\) variables to appropriate central tendencies - hence the “at Means.” generate QIs in out of sample data.

Average Marginal Effects (AME)

Average Marginal Effects are in-sample but create a counterfactual for a variable of interest, assuming the entire sample looks like that case.

For instance, suppose a model of wages with covariates for education and gender. We might ask the question what would the predictions look like if the entire sample were male, but otherwise looked as it does? Alternatively, what would the predictions look like if the entire sample were female, but all other variables the same as they appear in the estimation data?

To answer these, we’d change the gender variable to male, generate \(x{\widehat{\beta}}\) for the entire sample, and take the average, then repeat with the gender variable set to female.

Average Marginal Effects (AME)

  • estimate model.
  • in estimation data, set variable of interest to a particular value for the entire estimation sample.
  • generate QIs (expected values, standard errors).
  • take average of QIs, and save.
  • repeat for all values of variable of interest, and plot.

Methods for Quantities of Interest

  • direct computation - generate \(F(x\widehat{\beta})\) for interesting values of \(x\) (either as MEM or AME).
  • simulation of \(\widehat{\beta}\)
  • simulation of QI.

Uncertainty

We have two main quantities of interest - everything so far focused on a generating a predicted value from the model. Let’s think about generating measures of uncertainty for those predicted values.

This section examines ways to compute standard errors, and ways to use those to compute confidence intervals.

Uncertainty: Standard Errors of Linear Predictions

Consider the linear prediction

\[X \widehat{\beta} \]

under maximum likelihood theory:

\[var(X \widehat{\beta}) = \mathbf{X V X'} \]

an \(N x N\) matrix, where \(V\) is the var-cov matrix of \({\widehat{\beta}}\). The main diagonal contains the variances of the \(N\) predictions. The standard errors are:

\[se(X \widehat{\beta}) = \sqrt{diag(\mathbf{X V X'})} \]

which is an \(N x 1\) vector.

Uncertainty: Delta Method

The ML method is appropriate for monotonic functions of \(X \widehat{\beta}\), e.g. logit, probit. In other models (e.g., multinomial logit), the function is not monotonic in \(X \widehat{\beta}\) so we use the Delta Method - this creates a linear approximation of the function. Greene (2012: 693ff) gives this as a general derivation of the variance:

\[Var[F(X \widehat{\beta})] = f(\mathbf{x'\widehat{\beta}})^2 \mathbf{x' V x} \]

Where this would generate variances for whatever \(F(X \widehat{\beta})\) is, perhaps a predicted probability.

Uncertainty: Standard Errors of \(p\) in Logit

By delta transformation is given by:

\[F(X \widehat{\beta}) * (1-F(X \widehat{\beta}) * \mathbf(X V X')\]

\[ = f(X \widehat{\beta}) * \mathbf(X V X')\]

or

\[ p * (1-p) * stdp\]

Standard Errors for Predicted Probabilities

\[ \operatorname { Var } \left[ \operatorname { Pr } \left( Y _ { i } = 1 \right) \right) ] = \left[ \frac { \partial F \left( \mathbf { X } _ { i } \hat { \boldsymbol { \beta } } \right) } { \partial \hat { \boldsymbol { \beta } } } \right] ^ { \prime } \hat { \mathbf { v } } \left[ \frac { \partial F \left( \mathbf { X } _ { i } \hat { \boldsymbol { \beta } } \right) } { \partial \hat { \boldsymbol { \beta } } } \right] \]

\[ = \left[ f \left( \mathbf { X } _ { i } \hat { \boldsymbol { \beta } } \right) \right] ^ { 2 } \mathbf { X } _ { i } ^ { \prime } \hat { \mathbf { V } } \mathbf { X } _ { i } \]

\[s.e. (Pr(y=1)) = \sqrt{\left[ f \left( \mathbf { X } _ { i } \hat { \boldsymbol { \beta } } \right) \right] ^ { 2 } \mathbf { X } _ { i } ^ { \prime } \hat { \mathbf { V } } \mathbf { X } _ { i }}\]

Uncertainty: SEs of Predictions for linear combinations

A common circumstance that requires joint hypothesis tests is the case of polynomials (which are themselves interactions):

\[y = \widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_{1}^2 + \varepsilon \]

The question is whether \(\widehat{\beta}_1 = \widehat{\beta}_2 = 0\) - the marginal effect is:

\[ \widehat{\beta}_1 + 2 \widehat{\beta}_2x_1\]

and requires the standard error for \(\widehat{\beta}_1+\widehat{\beta}_2\), which is:

\[ \sqrt{var(\widehat{\beta}_1) + 4x_{1}^{2}var(\widehat{\beta}_2) + 4x_1 cov(\widehat{\beta}_1, \widehat{\beta}_2) }\]

Uncertainty: CIs - End Point Transformation

Generate upper and lower bounds using either ML or Delta standard errors, such that

\[F(X \widehat{\beta} - c*s.e.) \leq F(X \widehat{\beta}) \leq F(X \widehat{\beta} + c* s.e.)\]

  • estimate the model, generate the linear prediction, and the standard error of the linear prediction using the either ML or Delta.
  • generate linear boundary predictions, \(x{\widehat{\beta}} \pm c * \text{st. err.}\) where \(c\) is a critical value on the normal, eg. \(z=1.96\).
  • transform the linear prediction and the upper and lower boundary predictions by \(F(\cdot)\).
  • With ML standard errors, EPT boundaries will obey distributional boundaries (ie, won’t fall outside 0-1 interval for probabilities); the linear end point predictions are symmetric, though they will not be symmetric in nonlinear models.
  • With delta standard errors, bounds may not obey distributional boundaries.

Uncertainty: Simulating confidence intervals, I

  • draw a sample with replacement of size \(\tilde{N}\) from the estimation sample.
  • estimate the model parameters in that bootstrap sample.
  • using the bootstrap estimates, generate quantities of interest (e.g. \(x\widehat{\beta}\)) repeat \(j\) times.
  • collect all these bootstrap QIs and use either percentiles or standard deviations to measure uncertainty.

Uncertainty: Simulating confidence intervals, II

  • estimate the model.
  • generate a large sample distribution of parameters (e.g. using drawnorm).
  • generate quantities of interest for the distribution of parameters.
  • use either percentiles or standard deviations of the QI to measure uncertainty.
Back to top

References

Oneal, John R., and Bruce M. Russett. 1997. “The Classic Liberals Were Right: Democracy, Interdependence, and Conflict, 1950-1985.” International Studies Quarterly 4 (2): 267–94.